We introduce Transfusion, a recipe for training a multi-modal model overdiscrete and continuous data. Transfusion combines the language modeling lossfunction (next token prediction) with diffusion to train a single transformerover mixed-modality sequences. We pretrain multiple Transfusion models up to 7Bparameters from scratch on a mixture of text and image data, establishingscaling laws with respect to a variety of uni- and cross-modal benchmarks. Ourexperiments show that Transfusion scales significantly better than quantizingimages and training a language model over discrete image tokens. By introducingmodality-specific encoding and decoding layers, we can further improve theperformance of Transfusion models, and even compress each image to just 16patches. We further demonstrate that scaling our Transfusion recipe to 7Bparameters and 2T multi-modal tokens produces a model that can generate imagesand text on a par with similar scale diffusion models and language models,reaping the benefits of both worlds.中文翻译:
输血:使用一个多模态模型预测下一个标记和扩散图像我们引入了 Transfusion,这是一种在离散和连续数据上训练多模态模型的方法。 Transfusion 将语言建模损失函数(下一个标记预测)与扩散相结合,以在混合模态序列上训练单个变压器。我们在文本和图像数据的混合上从头开始预训练高达 7B 参数的多个 Transfusion 模型,建立了关于各种单模态和跨模态基准的缩放法则。我们的实验表明,Transfusion 的缩放效果明显优于量化图像和在离散图像标记上训练语言模型。通过引入特定于模态的编码和解码层,我们可以进一步提高 Transfusion 模型的性能,甚至将每个图像压缩到仅 16 个补丁。我们进一步证明,将 Transfusion 配方扩展到 7B 参数和 2T 多模态令牌会产生一个模型,该模型可以生成与类似规模扩散模型和语言模型相同的图像和文本,从而获得两全其美的好处。